OGrep Regular Expressions

Regular expressions can be used as expressions within match-strings. They allow a more general mechanism for doing pattern-matches than having to specify each character specifically. For example, the '.' matches any character, so using the sequence 'a.c' would match any three character sequence starting with 'a' and ending with 'c'. This page discusses the various pattern matching mechanisms available when regular expression matching is enabled.

Quoting

Because the pattern matching function of regular expressions uses characters to specify patterns, those characters can not be matched directly. For example '.' matches any character, but there may be instances when you want it to match only a period. To resolve this, the pattern matching symbols may be quoted by preceding them with the '\' character. This means that the pattern 'a\.b' matches only the sequence a.b.

Since the quote character generally means quote the next character, trying to match a quote character means the quote character itself has to be quoted. For example to match the text 'a\b' one one have to write the pattern 'a\\b'.

The quote character is sometimes used to extend the working of the pattern matcher, for example the sequence \b does not mean a 'b' character is being quoted, it means match the beginning or ending of a word.

Basic Pattern matching

The following symbols match various types of patterns:

Range Matching

Some of the basic pattern matching such as '+' can match multiple occurances of a character. Range Matching is a more general statement on the number of occurances of a character, for example you can match a bounded range, say from two to four 'a' characters by doing the following:

OGrep "a\{2,4\}"  *.c

Matching set members

Brackets [] can be used to delimit a set of characters. Then the bracketed sequence will match any character in the set. For example

OGrep "[abc]" *.c

matches any of the characters a,b,c. A range of characters can be specified:

OGrep "[a-m]"  *.c

matches any characters in the range a-m.

Set negation is possible:

OGrep "[^a-z]" *.c

matches anything but a lowercase letter.

Sets can be more complex:

OGrep "[A-Za-z0-9]" *.c

matches any alphanumeric value.

Sets can be combined with more basic pattern matching:

OGrep "[A-Z]?" *.c

matches zero or one upper case characters.

Ogrep "[0-9]\{2,4\}"

matches from two to four digits.

Region matching

Sometimes, it is desirable to match the same sub-pattern multiple times before having grep declare the pattern as a match for the text. In a simple case:

OGrep "\(myword\)|\0" *.c

matches the string:

myword|myword

In this case the quoted parenthesis surround the region to match, and the \0 says match that region again. This is not a very interesting case, but when combined with other pattern matching it becomes more powerful, because the \0 doesn't reapply the pattern but instead matches exactly the same pattern as before. For example to combine it with a set:

OGrep "([a-z])||\0" *.c

matches any lower-case word twice, as long as it is separated from itself by two | characters. This pattern would match 'ab||ab' but not 'ab||xy'.

Up to ten such regions may be specified in the pattern; to access them use \0 \1 \2 ... \9 where the digit gives the order the region is encountered within the pattern.